Introduction to Open Data Science - Course Project

Tuomas Keski-Kuha 6.11.2021

Here’s a link to my IODS-project

Hi!

Nice to be on the Intro course on Open Data Science. I work in risk management field in insurance business. I have studied applied mathematics and statistics in University of Helsinki some 10 years ago. I expect to learn some basics about the open data science. I found this course at the Mooc courses page.

Looking forward to learn more about R, and my goal is to study more R statiscs so I’m albe to use it in my work also, and perhaps also as a hobby! :D So starting with happy feelings, but in a some hurry to finish first week task


Regression and model validation

Tuomas Keski-Kuha 13.11.2021

Exploring the data

The data is from the study where intention was to study different variables affecting students grades. It is a question type study.

# first let's read the data: 
learning2014 <- read.csv(file = "data/learning2014.csv", 
                      stringsAsFactors = TRUE)

# let's study the data structure etc. 
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21

In the data we have 166 observations (studens), and 7 variables. Some of the variables are describing the studens (gender, age, attitude) and (deep, stra, surf) are stundens’ answers in the study. Points are the points in the test, and the result variable.

Now let’s dig deeper in to the data, and also get an overview of it:

#  ggplot library is allready installed and let's get the library into use
library(ggplot2)

#summarise data
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
# let's also use library GGally

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# then let's draw a plot where also the possible dependencies can be seen

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

In the summary picture you can really get the feeling of the data. Gender gives the coloring in the picture and all the information can be seen depending on the gender. In the picture you can see all the distributions of the variables.

Here’s few remarks on the distributions:

  • Age is concentrated around the usual student ages (under 30), but there are still lots of older students in the stud.
  • All variables have significant variability
  • There seems to be attitude difference between gender (females have better attitude)

Here’s few remarks on the correlations:

  • Attitude has biggest correlation between points
  • Stra has also a small positive correlation between points as surf has a small negative correlation between points
  • All other correlations are reasonably small, but there’s some correlation differences when examining between genders, like suface answers and deep answers for males
  • Attitude seems to be gender related

Linear model fitting

Next, let’s apply linear model to the data. Response or dependent variable (is the variable you think shows the effect) here is the points and all other are predictors orindependent variables (is the variable you think is the cause of the effect).

Let’s choose three most valid independent variables by looking biggest correlations between the target variable (points). We get attitude, stra and surf as three biggest absolute correlation values.

 # Let's create a linear regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

We fitted linear model: y = a0 + a1x1 + a2x2 +a3x3 Looking the results:

  • the constant (a0) has a quite large estimated value, but also standard error is quite large. Corresponding p value, it looks like that zero hypothesis (a0 = 0) is unlikely (probability for that is 0,3 %)
  • a1 -value or the attitude variable multiplier is statistically very significant, so the zero hypothesis (a1 = 0) is very unlikely (p-value is really small)
  • other parameters a2, a3 have not significant p-values, so it seems reasonable not to have x2 and x3 in the model (zero hyphotesis is likely a2 = a3 = 0)

Of course model will fit differently if we just ignore first one of the non significant variables, but here it seems that small a2 or a3 would not enchange the model that much, and even those extra parameters could even worsen the models prediction capability. But let’s fit a linear model with two variables (attitude and stra):

 # Let's create a linear regression model with multiple independent variable
my_model_2 <- lm(points ~ attitude + stra, data = learning2014) 
summary(my_model_2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Dropping the surf value did not affect that much, but at least it got for the a0 a better fit. Still overall results aren’t that impressive without the surf, if you look the residual standard error of residuals and also the residual values. So it seems that the best thing here would be to drop also the stra value (usually simple the better). It looks like residuals have the same kind of distribution as in the previous model.

Multiple R-squared value (ranges 0-1) gives answer the the proportion of the variance in the response variable (the effect) of a regression model that can be explained by the predictor variables (the cause). Low value here would indicate that the variance in response variable (points) is not explained that well by the predictors variables (attitude and stra). And here indeed R-squared value is not impressive, and there’s lot of variability which is not explained well by the predictors.

Also the adjusted R-squared did not change that much between the fist and the second model, and its score 0,2. Adjusted R-squared is more handy than Multiple R-squared, because it takes into account the number variables in the model and adjust the Multiple R-squared accordingly (R-squared willi increase as you add more variables). So with adjusted R-squared one can really evaluate the difference between the models when we consider the variance in the response variable.

For the fun of it, let’s also try fitting a simple line (y = a0+a1x1) where x1 is attitude.

 # Let's create a linear regression model with  just one independent variable
my_model_3 <- lm(points ~ attitude, data = learning2014) 
summary(my_model_3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

We observe that the standard error of the residuals became a bit larger even though the fitting of the parameters was more successfull. Adjusted R-squared still became lower which is not good, but is the difference here significant. I guess not if you observe the standard errors of the residuals (model vs. actual values) and compare them with previous modesl with multiple predictive variables.

Graphical model validation

Let’s first draw a simple regression line to the data. Attitude being predictive variable:

# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
# define the visualization type (points)
p2=p1+geom_point()
# add a regression line
p3 = p2 +geom_smooth(method = "lm")
# draw the plot
p3
## `geom_smooth()` using formula 'y ~ x'

Let’s draw diagnostic plots and choose the following plots:

  • Residual vs Fitted values
  • Normal QQ-plot
  • Residuals vs Leverage
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5

plot(my_model_3, which = c(1, 2, 5))

Commenting the plots:

  • Residuals vs Fitted values: In this plot we can see that the linearity is a valid assumption. Still there is lots of variablity but perhaps if we had more observations linearity assumption would even be stronger… or not. There’s also some outliers (145, 56, 35) which may effect the linearity assumptions and also the fitting. Also in this plot we don’t have a clear pattern, so it indicates that errors have a constant variance, and the variability of actual error sizes does not depend on the predictive variables.
  • Normal QQ-plot: It looks like normality assumption of the errors is quite valid here. In the plot we have a nice line, and it indicates that the errors comes from normal distribution. Still there are some outliers which do not fit so nicely, but as it is the extreme values are rare. But just looking the picture normality assumtion is not far fetched.
  • Residuals vs. Leverage: This plot gives the meaning of the outliers (single observations) for the model, and do the outliers have an unusually high impact to the model. It looks like there is no outliers which are significant for the model. Those points which could have a high impact does not fall far away from the average of standardized residual. Outliers in the data are that close to average of attitude, so outliers are not so effective.

Logistic regression

Tuomas Keski-Kuha 21.11.2021

Data exploration and finding relevant variables

# access a few libraries: 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)

# first let's read the data: 
alc<- read.csv(file = "https://github.com/rsund/IODS-project/raw/master/data/alc.csv", 
                      stringsAsFactors = FALSE, sep = ",")

# let's study the data structure etc. 
glimpse(alc)
## Rows: 370
## Columns: 51
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",~
## $ sex        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",~
## $ age        <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,~
## $ address    <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",~
## $ famsize    <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE~
## $ Pstatus    <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",~
## $ Medu       <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,~
## $ Fedu       <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,~
## $ Mjob       <chr> "at_home", "other", "at_home", "services", "services", "ser~
## $ Fjob       <chr> "other", "other", "other", "health", "services", "health", ~
## $ reason     <chr> "home", "reputation", "reputation", "course", "reputation",~
## $ guardian   <chr> "mother", "mother", "mother", "mother", "other", "mother", ~
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,~
## $ studytime  <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,~
## $ schoolsup  <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",~
## $ famsup     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", ~
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"~
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ internet   <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes~
## $ romantic   <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n~
## $ famrel     <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,~
## $ freetime   <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,~
## $ goout      <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,~
## $ Dalc       <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,~
## $ Walc       <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,~
## $ health     <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,~
## $ n          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
## $ id.p       <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,~
## $ id.m       <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,~
## $ failures   <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,~
## $ paid       <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences   <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1~
## $ G1         <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,~
## $ G2         <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14~
## $ G3         <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,~
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
## $ paid.p     <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2~
## $ G1.p       <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,~
## $ G2.p       <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,~
## $ G3.p       <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,~
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,~
## $ paid.m     <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",~
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, ~
## $ G1.m       <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1~
## $ G2.m       <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ G3.m       <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ alc_use    <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,~
## $ high_use   <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,~
## $ cid        <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,~
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

Data set information (straight from the web page):

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Still we are going to use the data to predict binary variable for alcohol consuption (high_use).

Expected 4 variables to effect alcohol consumption are as follows (the variable picking was biased as I did see the results in the datacamp exercise :D ). Anyway let’s pick

  • sex: it is quite typical that males drink more
  • absences: one could think that absences would go up for high alcohol users due to hangovers and other side effects
  • failures: same reasoning as for absences
  • goout: thinking here is quite obvious, someone goes out more is perhaps drinkin more

Let’s draw some plots and tables if we could see whether there might be any relationship between target variable (high_use) and predictors.

# produce summary statistics by group for studying relationship between high_use and sex
alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      154
## 2 F     TRUE        41
## 3 M     FALSE      105
## 4 M     TRUE        70
# let's draw a plot to study how absences might relate to the target variable
g1 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

g1 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

  • sex: It looks like there is big difference in high alcogol users between sexes. Males tend to use much more, as expected before.
  • absences: In the plot we draw absences with sex, and there we could also see that in high users there is more tendency to be absent from the classes. It is interesting that for females the difference is not that significant as it is for males.
# initialize a plot of high_use related to gout
g2 <- ggplot(data = alc, aes(x = goout, fill = high_use))

# define the plot as a bar plot and draw it
g2 + geom_bar() + ggtitle("Student go-out-tendency by high alcohol consumption")

#  let's count few crosstables for examining high_use relatedness to failures
table(high_use = alc$high_use, failures = alc$failures)
##         failures
## high_use   0   1   2   3
##    FALSE 238  12   8   1
##    TRUE   87  12   9   3
  • goout: It can be seen from the bar chart that students who go out more will use alcohol more. The effect is quite clear, and this was quite expected.
  • failures: There can be seen a tendency to fail classes in the past if one is high user as expected, but this variable is tricky because 1-3 failure classes has so few instances. It could disturb modeling with this feature.

Logistic regression model

In this chapter, let’s apply a logistic regression model to predict high alcohol use. For the model we use those variables which seemed have an effect to high alcohol use (sex, absences, go_out, failures).

# find the model with glm() (target high_use)
m <- glm(high_use ~ sex + goout + absences + failures, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + goout + absences + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1521  -0.7929  -0.5317   0.7412   2.3706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.14389    0.47881  -8.654  < 2e-16 ***
## sexM         0.97989    0.26156   3.746 0.000179 ***
## goout        0.69801    0.12093   5.772 7.83e-09 ***
## absences     0.08246    0.02266   3.639 0.000274 ***
## failures     0.48932    0.23073   2.121 0.033938 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.50  on 365  degrees of freedom
## AIC: 379.5
## 
## Number of Fisher Scoring iterations: 4

Model summary seems to tell that all of the predictive variables have good fit overall and are statistically significant (all others have p-value < 0.001, but for failures the p-value is 0,03 which is not that significant, and it is reasonable to assume that it won’t be significant for prediction power). Also here we cannot be sure that there might be strong correlations between predictors, so perhaps it could be wise to study those also.

Let’s also present the coefficients and also oddsrations, and confidence intervals:

# compute odds ratios (OR) round the results
OR <- coef(m) %>% exp %>% round(2)

# compute confidence intervals (CI), round the results
CI <- confint(m) %>% exp %>% round(2)
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##               OR 2.5 % 97.5 %
## (Intercept) 0.02  0.01   0.04
## sexM        2.66  1.61   4.49
## goout       2.01  1.59   2.56
## absences    1.09  1.04   1.14
## failures    1.63  1.04   2.59

Did not have time to make adjustmens with the Intercept factor, but luckily it’s near zero, so it won’t matter that much in this case. It seems that sex and goout odds ratios are way bigger than one, and that would indicate that they are relevant for probability of high alcohol use. Failures and absences do still have figure more than one but just slightly, so they not seem to be so relevant for the high alcohol use probablity. Also the confidence levels for these values indicate that they are not that significant.


Predictions

Let’s use the logistic regression model for predicting high alcohol use from the selected data. Well calculate probability for high alcohol use and then form a new variable which is true if probability is bigger than 0.5 and false if otherwise. After that we make a cross table for model prediction and actual high use variable (results of the study), and also a graphical presentation of the prediction vs. results.In the end also

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   16
##    TRUE     61   50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

Model gives sensible results. We can see that there are wrong results 61+16 vs. correct ones 243+50.

Let’s also calculate a loss function, and use that for calculating inaccuracy of the model. This can also be calculated from the figures in the previous paragraph (incorrect cases divided by all cases).

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
round(loss_func(class = alc$high_use, prob = alc$probability), 4)
## [1] 0.2081

So the model is correct on average for 79 % of the cases and incorrect for 21 % of the cases.

It seems that it’s hard to predict high use (61 times model get it wrong vs. 50 correct ones), but for low uses it’s far more precise (16 wrong vs. 243 correct). But if you’d use simple guessing (flip of a coin for a model) it will be correct approx 50 % of the cases, so the model gives reasonable accuracy.


Cross-validation

Let’s make a 10 fold cross-validation for validating the logistic regression model we used above:

  • We divide data into 10 same size segments (data groups 1, 2, 3, …, 10).
  • Then lets fit the model for data groups 2 to 10 (training set) and measure prediction error for the group 1 (test set).
  • After that we use data groups 1, 3, 4…, 10 and measure prediction error for group 2.
  • We continue till we have used all the data this way. Last prediction error is calculated for the data group 10 fitting the data from groups 1 to 9.
  • Then we can average all these prediction errors for the validation figure.
# setting the random seed, so we can re-calculate exactly same results over again
seed = 123
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross-validation, this is not adjusted value, but a raw inaccuracy rate in the cross-validation
round(cv$delta[1], 4)
## [1] 0.2243

So cross-validation gives a really close figure comparing to prediction error we had when used all the data for training.


Cross-validating different models

# Let's define the wanted predictor sets (just using the column numbers: 

# first test set has many predictors (alomost all of them and more than 40)
testi_1 <- c(1:24, 27, 29:30,  31, 33:48, 50)
# we drop few off in testi_2 predictor set, but it still has quite many (approx 30)
testi_2 <- c(2:24, 29:31, 33:38, 50)
# we drop some more, still keeping the most relevant (approx 20)
testi_3 <- c(2:11, 12, 17, 24, 29:30,  31, 33:35, 50)
# now we keep just 10
testi_4 <- c(2:5, 24, 27, 30,  31, 33, 50)
# the last predictor set has only 3, number 50 is high_use (target variable)
testi_5 <- c(2, 24, 33, 50)


# build a list for different prodictor set
testi_setti <- list(testi_1, testi_2, testi_3, 
                    testi_4, testi_5)

# initialising the result matrix
tulokset <- matrix(nrow = 2, ncol =5)

# defining a loop which takes one testi_setti items (predictor vectors) at a time, and calculates training error for the whole data set and also test error for the 10-fold cross-validation (all this code is copy pasted from above)

for (i in 1:5) {
  
pred_var <- testi_setti[[i]]

# find the model with glm() (target high_use)
m2 <- glm(high_use~., data = alc[pred_var], family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# call loss_func to compute the average number of wrong predictions in the (training) data
train_err <- round(loss_func(class = alc$high_use, prob = alc$probability), 4)

seed= 123

# cross-validation
cv <- cv.glm(data = alc[pred_var], cost = loss_func, glmfit = m2, K = 10)

# here we'll have results for cross-validation
test_err <- round(cv$delta[1], 4)

# saving the results to tulokset matrix
tulokset[1,i] <- train_err
tulokset[2,i] <- test_err

}

tulokset
##        [,1]   [,2]   [,3]   [,4]   [,5]
## [1,] 0.1757 0.1946 0.2000 0.2162 0.2108
## [2,] 0.2784 0.2784 0.2378 0.2270 0.2162

In tulokset matrix we have the result so that the first row is for training errors for the whole data and second row is test errors for cross-validation. Column 1 model has almost all predictor that one can pick and we decrease predictors (keeping the most relevant) till we get to fifth column which has only three.

It can bee noticed that training error tend to go up when there are fewer predictors in the model. The model learns better from the data and is more complex with more predictors, but on the other hand test error is very large. So the model with many predictors becomes dependent on the data that has been used, and it does not work that well when you fit it with different cross-validation sets. One would say that good model with significant predictor is far more robust, and simpler and passes the testing better.


Clustering and classification

Tuomas Keski-Kuha 22.11.2021


Data exploration and finding correlations


Exercise 2: Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here.


Let’s see the Boston data in the following:

# access a few libraries: 
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
library(GGally)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Data set information

Data is Housing Values in Suburbs of Boston. It has 506 rows and 14 columns. It has some information on the suburbs of Boston (towns). Here is the link to the data page, where one can see the full details.


Exercise 3: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.


Let’s visualize the data and take a look of the distributions of the variables and relations between variables.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# First lets make a long table from Boston data
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
# Then plot every variable, so we'll see the distributions of variables
ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

There is significant variability in the data as some variables are really skewed towards the minimum or the maximum value (most of the values are near or exactly minimum or maximum). There is also few integer variables in the data, which don’t fit quite well to the density plot above. Few of the variables seems normally distributed.

It is clear that there is really strong correlations (positive and negative) between variables in the data, as correlation plot shows. In the correlation plot the colour and the size of the circles tells about the correlations between variables.

Quite interestingly the rad (radial highway) accessibility) and the tax (property tax rate) variables have the biggest positive correlations between the crime rate in towns. Some might think that variables like dis (distances to employment centers) and lstat (lower status of population(percent)) would have more correlation between crime rate. In the plot we can also see that rad and tax are heavily correlated so they are almost like one variable.

We can also see that there seems to be significant variables like:

  • rad: correlation between crime rate
  • tax: correlation between rad and crime rate
  • indus (proportion of non-retail business, traditional industry): has many significant correlations between other key variable
  • nox (nitrogen oxides concentration): has many significant correlations between key variables, seems like a variable which is hard to interpret in any reasonable way, at least there is positive correlation between indus.
  • age (proportion of owner-occupied units built prior 1940): old buildings is negatively correlated between distance to centers and positively between indus (more older buildings in more industrialized towns)
  • dis: many significant correlations between other variables, quite logically negatively correlated between indus so traditional industry is far from main centers.
  • lstat: many correlations to other variables, but perhaps not that significant

All in all, hard to gain significant information in this correlation analysis.


Data preparing and making training and test datasets


Exercise 4: Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.


Let’s scale the variables so that the mean = 0 and variance = 1 for all variables in the dataset:

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We can see now that all the variables have negative and positive values around zero.

In the following section we create a new categorical quantile variable from crim variable, so that the new variable has labels low, med_low, med_high, high depending the scaled crim variable, and all the new variables have 25 % of the data.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

After that we make training and test sets from the data, so that we pick randomly 80 % of the rows to the training set and the rest to the test set.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

set.seed(123)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Classification - Fitting the Linear Discriminant Analysis model


Exercise 5: Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.


Let’s fit the LDA model to training data and visualize results.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.01866313 -0.9066422 -0.08120770 -0.8835724  0.38666334 -0.9213895
## med_low  -0.07141687 -0.3429155  0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591  0.2162741  0.19544522  0.3637157  0.12480815  0.4564260
## high     -0.48724019  1.0149946 -0.03371693  1.0481437 -0.47733231  0.8274496
##                 dis        rad        tax    ptratio      black        lstat
## low       0.9094324 -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775
## med_low   0.3694282 -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124
## high     -0.8601246  1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309
##                 medv
## low       0.47009410
## med_low   0.01548761
## med_high  0.19440747
## high     -0.71294711
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## chas     0.002460776  0.03963849  0.1699312
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Discrimination analysis seem to fit the data quite well, as plot indicates. Rad variable (index of accessibility to radial highways) and also zn (proportion of residential land zoned for lots over 25,000 sq.ft.) seem to have biggest effect on the classification process. It looks like rad variable gives really good discrimination for higher crime towns. For the lower crim towns it’s not that clear at all. In the picture it seems that rad variable separates higher crime towns from the rest.

Why then the radiation (index of accessibility to radial highways) has so significant effect. It seems not to be obvious that good connections would effect to crime rate that much and others won’t (like lower status population, industrialization or distance from employment centers). Would it perhaps be that there is more density of population near highways or something like that, or more population near highways. Anyway rad seems to be good indicator for high crime rate.


Exercise 6: Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.


In the next phase we save crime gategories from the test set. Also we make predictions from test data, and see how accurate the model is:

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        4    0
##   med_low    2      17        6    0
##   med_high   1       9       13    2
##   high       0       0        1   27

Model’s predicting power seems to be good indeed. Just few bigger inaccuracies, where correct class is two classes away. Still all results end up quite near the diagonal. Especially the high class seems to be “easy” to predict.


Clustering - Fitting the K-means model


Exercise 7: Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.


# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Then, we calculate the distances and apply K-means algorithm to make clusters:

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)

Finding the optimal value for the number of clusters:

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal value is the value where the WCSS (within cluster sum of squares) drops radically (here the optimal number of clusters for K-means seems to be 2).

# k-means clustering when k = 2
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston scaled dataset with clusters
pairs(boston_scaled, col = km$cluster)

We can see that perhaps crim, indus, rad, tax, black variables are best clustering variables (into two clusters), so let’s see figures with only those variables.

# lets make a vector containig significant separating variables
a_1 <- c(1, 3, 9, 10, 12)

# plot the Boston scaled dataset with clusters
pairs(boston_scaled[, a_1], col = km$cluster)

Presented variables seems to have largest effect on the clustering (into two clusers) as in the plot we can see good separation between the black and pink groups.

Next, we draw few plot more where one can see these significant variable values by clusters:

# let's draw few plots more where significant variable values are presented in a boxplot by clusters
par(mfrow=c(1,3))

boxplot(boston_scaled$rad ~ km$cluster, main = "Rad by clusters")

boxplot(boston_scaled$tax ~ km$cluster, main = "Tax by clusters")

boxplot(boston_scaled$black ~ km$cluster, main = "Black by clusters")

boxplot(boston_scaled$indus ~ km$cluster, main = "Indus by clusters")

boxplot(boston_scaled$crim ~ km$cluster, main = "Crim by clusters")

Perhaps one would not become wiser on looking drawn boxplots because many of the picked significant clustering variables has just few instances by cluster (e.g. crim, indus, tax), so one can critisize the separating power here by just one variable. But at least we had more some nice pictures here.


Classification of clusters (Bonus)


Bonus exercise: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?


# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# k-means clustering
km <-kmeans(boston_scaled, centers = 5)

Let’s add clusters to the data and then apply linear discrinmination analysis to the clusters that we got from K-means clustering:

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)

set.seed(123)

# linear discriminant analysis
lda.fit_2 <- lda(km.cluster ~ ., data = boston_scaled)

# plot the lda results
plot(lda.fit_2, dimen = 2, col = boston_scaled$km.cluster, pch = boston_scaled$km.cluster)
lda.arrows(lda.fit_2, myscale = 1)

It seems clear that chas and rad variables are most influencial linear separators for the five clusters. Rad we handled before, it would seem that the highway accessibility makes some difference and separates towns. Chas is indicating that the riverside towns are significantly different than the non-riverside towns.


3D-plotting (Super-Bonus)


model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~train$crime)
# k-means clustering
km_2 <-kmeans(matrix_product, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~km_2$cluster)

Not really sure what was the intention with the latter 3D-plot. I applied K-means (k=4, same amount than in the first plot where predicted) to matrix product data (prediction for the LDA model) intending then to do clustering for predictions. This latter K-means clusterin makes data to look reaaly smoothly separated and there is no outliers or obvious outliers, or data points “inside” other group as in the first one. Now that I look at the second plot, it looks like just the same as you would have if you plot predictions od LDA model.


#{r child = "chapter5.Rmd"} # This will include Chapter 2 (that is updated in its own file) in the document. #

#***